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ABSTRACT 

This document contains three papers from the 
Methodology Project of the Center for the Study of Evaluation. 
Methods for characterizing test accuracy are reported in the first 
two papers. "Bounds' on the K Out of N Reliability of a Test, and an 
Exact Test for Hierarchically Related Items" describes and 
illustrates how an extension of a latent structure model can be used 
it: conjunction with results in Sathe (1980) to estimate the upper and 
lower bounds of the probability of making at least k correct 
decisions. "An Approximation of the K Out [of] N Reliability of a 
Test and a Scoring Procedure for Determining Which items an Examinee 
Knows" proposes a probability approximation that can be* estimated 
with an answer-until-correct (AUC) test. "How Do Examinfees Behave 
When Taking Multiple Choice Tests?" deals with empirical studies of 
AUC assumptions. Over two hundred examinees were asked to record the 
_pjrjle^-«r-wh-i^h— They chose their responses. Findings indicate that 
Horst's (1933) assumption that examinees eliminate as many 
distractors as possible and guess at random from among those that 
remain appears to be a tolerable approximation of /reality in most 
cases. (Author/PN) 
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ABSTRACT 

Consider an n-item multiple choice test where it is decided that 
an examinee knov/s the answer if and only if he/she gives the correct 
response. The k out of n reliability of the test,, p^, is defined to 
be the probability that for a randomly sampled examinee, at least k 
correct decisions are made about whether the examinee knows the answer to 
an item. The paper describes and illustrates how an extension of a 
recently proposed latent structure model can be used in conjunction 
with results in Sathe et al. (1980) to estimate upper and lower bounds 
on p.. A method of empirically checking the model is discussed. 
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Consider a randomly sampled examinee responding to a multiple- 
choice test item. In mental test theory there are, of course, many 
procedures that might be used to analyze this item. One approach might 
be as follows. Suppose a conventional scoring procedure is used where 
it is decided that an examinee knows the correct response if the correct 
alternative is chosen, and that otherwise the "examinee does not know. 
If it were possible to estimate the probabil ity, t, of correctly deter- 
mining an examinee's latent state (whether he/she knows the correct 
response) based on the above decision rule, this would give an indication 
of how well the distractors are performing for the typical examinee. The obv 
problem is that under normal circumstances, there is no way of estimating 
this probability unless additional assumptions are made. One approach 
is to assume that examinees guess at random among the alternatives when 
they do not know the .answer. If "this knowledge or random guessing model 
holds, t is easily estimated. However, empirical investigations (Bliss, 
1980; Cross & Frary, 1977) suggest that this assumption will frequently 
be violated, and some related empirical results (Wilcox, T982, in press a) 
indicate that such a model can be entirely unsatisfactory for other reasons 
as well . 

Another approach is to use a latent structure mode'l , and many such 
models have been proposed for measuring -achievement (e.g., Brownless & 
Keats, 1958; Marks & Noll, 1967; Knapp, 1977; Dayton & Macready, 1976, 
1980; Macready & Dayton, 1977; Wilcox, 1977a, 1977b, 1981a; Bergan et al., 1 
The choice of a model depends on what one is willing to assume in a 
particular situation. These models make it possible to estimate errors 
at the item level such as 
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p - -Pr( randomly selectee! examinee gives the correct response! examinee 
does not know) £1] 
which in turn yields an estimate of x. ■ An illustration is given in a 
later section. (For a review of latent structure modeli vis-a-vis' 
criterion-referenced tests, see Macready and Dayton, 1980.) For some 
recent general comments on using latent structure models to measure 
achievement, see Molenaar (1981) and Wilcox' (1981b)." 

Assume for a moment that for each item on an n-item test, an estimate 
of x can be made. Let x i = 1 if a randomly selected examinee's latent state 
is correctly determined for the ith item; otherwise x i = 0. Then E^-) * x- 
(i = 1. n) is the probability of a correct decision on the ith_ item 
where the expectation is taken over the population of examinees. 

Within the framework just described, how should an n-item test be 
characterized? Observing that zx. is zr~ number of correct decisions 
among the n items, an obvious ap?rc=;h is to use 

u = E(Zx-) = Etj [2] 

where the expectation is over some particular population of examinees. 
The parameter y is just the expected number of correct decisions among 

the n items for a typical examinee. 

Knowing u might not be important for certain types of tests, but 
surely it is important for some achievement tests. However, even if 
p is known exactly, it would be helpful to have some additional related , 
information about Zx r For instance, a test constructor would have a 
better idea of how the test performs if VAR(ex { ) could be determined. 
The problem is that VAR(lx.) depends on C0V(x., Xj ), but this last quantity 
is not known, and at present there is no way of estimating it. An 
alternative approach is to use the k out of n reliability of the test 
0 (Wilcox, in Dress a J which is given by 
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p k = Pr(zx. > k) . - . . [3] 

In other words, ff the goal of a test 1s to determine which of n items 
an examinee knows, and if a conventional scoring procedure is used, p.. 
is the probability of making at least k correct decisions for the typical 
^ examinee. 

Suppose, for example, n = 10 and y is estimated to be 7. Thus, the 
expected number of correct decisions is 7, but there is no information 
about the likelihood that at le&st 7 correct decisions will be made. 
If P k were known, a test constructor would have some additional and 
i useful information for judging the accuracy of the test. p fc might also 
\ be used as follows. Suppose it is desired to have p q > .9, If u is 
estimated to be 9.1, this is encouraging, but it is not clear what 
implications this has in terms of nsfcir.g at least 8 correct decisions 
j for the typical examinee. 

If x. is independent of x., i f j, an exact expression for p. is 
available via the compound binomial distribution. Perhaps there are 
situations where this independence might be assumsd, but it is evident 
that this independence will not always hold. If it can be assumed that 
C0V(x.,x.) ? 0, bounds on p. are available (IHlcox, in press). Recently 

1 J K 

Sathe, Pradhan, and Shah (1980) derived bounds onp. that make no 
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assumption about C0V(x. ,x.). The main point of this paper is that these 
bounds can be estimated using an extension of an answer-until- correct 
(AUC) scoring 'procedure proposed by Wilcox (1981a). 



An Extension of an Answer-Until -Correct Scoring Procedure 

As just indicated, an extension of results in Wilcox (1981a) is 
needed in order' to apply the bounds derived by Sathe et al. (1980) • 
First, however, it is helpful to briefly review the procedure and basic 
assumptions in Wilcox (1981a) * m . 

Consider a specific test item having t alternatives from which to 
choose, one of which is the correct response. Assume examinees respond 
according to an AUC scoring procedure. This means that examinees 
choose an alternative, and they are told ir^ediately whether the correct 
response has been identified. If they are incorrect another response 
. is chosen, and this process continues until they are successful- Special 
forms are generally available for administering AUC tests which make 
these tests easy to use in the classroom. 

Let S . -j be the proportion of examinees who know the correct 
response, and let (i » 0, ... , t-2) be the proportion of examinees 
who can eliminate i distractors given tha? they do 'not know/ Wilcox 
(IDGIa) assu .es that examinees eliminate as many distractors as they 
can, and then choose at random from among those th.it remain. If p i 
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is the probability of choosing the correct response on the ith attempt, 

then 

\ t-i 

Pi = I ^/(t - j) t). [4] 

Note that the model assumes that at least one effective distractor is 
being used. Put another way, no distinction is made between examinees 
who know the answer and examrmees who can eliminate all of the distractors. 
Also, the model assumes Pr (incorrect response [examinee knows) =0. In 
certain special cases this assumption can be avoided (e.g., Macready & 
Dayton, 1977), and the results reported here cire easily extended to this 
case Ccf. MoTenaar, 1981; Wilcox, 1981b). 
Assuming the model holds, 

and 

, = vi * i - pi - 1 - pj- 1163 

If in a random sample of N examinees, y^ examiness are correct on their 
ith attempt, p. y.fll is an unbiased estimate of p. which yields an 
estimate of Cj_ , and t. 

Although empirical studies .suggest that this model will frequently 
be reasonable (Wilcox, 1932a, 1982b), . there are instances where this 
will not be the case. For example, some items might require a misinfor- 
mation model, and an appropriate modification of the AUC scoring procedure 
has been proposed (Wilcox, 1982). The results outlined here are readily 

extended to this case, and a brief outline of how this can be done is given 
below. * 
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Consider any two. items on an n-item test, say items i and j. 
Applying results in Sathe et al. requires an estimate of T . -=Pr(x-=l ,x>l ), 
i.e., the joint probability of making a correct decision for both items 
i and j. The remainder of/ this section outlines how this might be done. . 

It is assumed that' an examinee's guessing rate is independent over 
the items that he/she does not know. This means, for example, that if 
an examinee can eliminate all but 2 alternatives on item i, and all but 

3 alternatives on item j, the probability of choosing the correct response 
• on the first attempt of both items is (l/2)(l/3) = 1/6. 

For the two items under considers tier,,, let p km (k, m = 1, ...» t) 
be the probability that a randan-.! y selected examinee chooses the correct 
response on the kth attempt of the firs- ite.7i; and the correct response 
on the nth atter-pt of the second. If e is the proportion of examinees 
who can eliminate g distractors from the first item and l\ distractors 
from the second (g, h = 1, r ..» t-1), then 

Pkm* ?I k S <ii'Kt - 1)<* - j)] . . }m 

m i=0 j=0 ,J 
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The last expression can be used to express *t-l ^ n terrtls o{? ^e Pkn/ S 
which can be used to estimate c w w> Note that if the first item has t' 
alternatives, t* f t, simply replace t-k with t'-k in equation 7. 

To clarify matters, consider the special case t = 3. Equation 7 
say^ that ! ' 

\l = «22 + ' 5 21 /Z + W 3 + ? 12 /2 + ? 11 /4 ? ? 10 /6 * *02 /3 W 



[9] 

Do] 
in] 

[12] 
[13] 
[14] 
[15] 
[16] 



Pi? = 


+ V 6 + W 9 • 

? 21 /2 + W 3 + 5 11 /4 + ? 10 /6 


+ W 6 * W 9 


Pl3 = 


? 20 /3 + ^10 /5 + W 9 




P21 = 


C 12 /2 + C 02 /3 + Cll /4 +U 01 /6 


+ W 6 + W 9 


P 22 = 


C„/4 + C 10 /6 + C M /6 *C 0Q /9 

- * • 




P23 = 


? 10 /6 + W 9 




P31 = 


5q2 /3 + 5(J1 /6 + C 00 /9 




P32 = 


5 01 76 + 'W 9 \ % 


0 


P 33 = 







Thus, starting with equation 16 



00 ^33 



9 P „. - . * H7] 



= 6(p, 9 - P,J * [18] 



01 UVH 32 K 33 



and eventually s 22 can be expressed in terms of f P km 's. Replacing 

che p. 's with their usual unbiased estimate yields an estimate"of ? 22 

Km! i - 

say £ 22 . But it can be seen that for the two items under consideration 
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(items i and j), x ~ 

*u - «22 * 1 - • ' v . C19] 

Replacing ? 22 and with l zl and p^ yields an estimate of x-j = PrU^l , Xj .=l), 
.say tj For arbitrary t, x.. is given by equation 19 v/ith s 22 replaced 

with c - i. Note however, that the model implies that certain inequali- 
t-l,t-l 

ties among the P^'s must hold: For example, V^^zi^ZY Est,imatl * n 9 the 
p »s assuming these inequalities are true requires an application of the - - 
pool-adjacent violators algorithm (Barlow : et aW. 1972). "Testing these in- - 
, equalities' can be accomplished by applying results in Robertson (1978). 

Bounds on p^ 

This section describes how the results in the previous section can 

be used to estimate bounds on p^. First, however, results in Sathe et al . 

(1S3Q) are summarized. • \ , 

Recall that u = z-. and let 

n-1 n _ _ 

■ S - I I x.. . [203 

i=l j=l-M 13 



[21] 



and 



v = (2S - k(k - l))/2 [22] 
Then, % 

"k- 1 - < k - 2 >°lc-l ' L23] 

P > ■ 

K ~ n(n - k + 1)^- 
If 2V k _ 1 . < (n + k - 2)U k l , then 

> 2( U»- ^r^W [24] 
Pk ~ (k* - k)(k* - k + 1) 




where k* +.k - 3 is the largest Integer in 2V R _ 1 /U k _ r Two upper 
bounds on p k are also given. The- first is 

P k <J + ((n + k - DU k - 2V R )/kn' ■ 
and the second is that if 2Vj, < (k - l)U k , 



[25] 



• , (k*-DU k - V k [26] 

P <1 - 2 ; — 

k " (k - k*)(k - k* + 1) 

where k* + k - 1 is the largest integer in 2V k /li k - 



An Illustration 



To illustrate how P_ k might be applied and interpreted, observations 
on seven items were analyzed according to the procedure outlined 
.afcove. Each item had two distractors, and they were found to be 
consistent with the assumptions of the answar-until -correct scoring 
model. (See Wilcox, 1981a). Table 1 shows the observed -frequencies 
for the first two items. The question to be answered is if these 
seven items are taken to be the whole test, do they give reasonably 
accurate information about what_the typical examinee knows? 

As "previously mentioned, the model described above implies that various 
inequalities among the p^-'s must hold. These inequalities were tested at 
the .25 level of significance with the procedure in Robertson (1978). In 
every case the observed responses were consistent with the model. 
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Generally, when estimating s 22 there is no need to estimate all 
of the,?'s in equations 8-16. For the situation at hand, % n can be 
estimated as follows. First compute 

hi' 3 = P 31 ~ P 32 / 
for the data in Table 1, this is .107. Next' compute 

?12 /2 = p 21 - p 22 - ^ 2 ' 6 
which is .074. Then 



[27] 



[28] 



.... % = Pn -Pi2-^i2 /2 - W 3 [29] 

which is equal to .225. Substituting these values into equation 19, 
the estimate of t 12 is t 12 = J5i Applying equation 6 to all seven items, 
it is seen- that ji = 5.434. In other words,, it is estimated that the 
expected number of correct decisions is 5.4-34. 

Next consider p^. The value of S was estimated to be 16.929. 
From equations 20 - 25, this implies that 

.42<p 5 <.74. [30] 

This analysis suggests that these seven items, taken as a whole, 
are not very accurate since there is at least a 26 percent chance of 
* making an incorrect decision on three or more it'ems. How should the 
. test be modified? Another important question is to what extent can 
it be improved? One approach to improving the test is to increase the 
number of distractors, and another approach is to try to modify or 
replace the distractors that are being used. The latter approach will 
be considered first. 
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The initial step 1n trying to decide whether to replace or modify 
the existing distractors 1s to determine the extent to which they can 
be improved. This can be done with the A measure in tlijcox (1981, eq..20). 
This measure is just the difference between the maximum possible .'alue 
of t and the estimated value given that c 2 = C 2 - Another related 
measure is the entropy function (see Wilcox, 1981a). This measures 
th., effectiveness of the distractors among the examinees who do not know 
the correct response by indicating the extent to which p 2 > > P t are 
unequal. The closer they are to being equal, the more effective are 
the distractors, i.e., guessing is closer to being random. It has been 
pointed out (Wilcox, 1981a) that A might be objectionable as a 
measure of the extent to which p_, ... , P t are equal, but for present pur- 
poses it would seem to be of interest because increasing p R depends on 
the extent to which t can be increased for each item. 

Referring to Wilcox (1931a), a little algebra shows that for the 
case t = 3, . • 

A = (P 2 - P 3 )/2 - m 
For item 1 in Table 1, A = .024, and for Item 2 it is .034 (a is assumed 
to be positive; so if p £ < p 3 , apply the pool-adjacent violator algorithm 
in which case A is estimated to be zero.) 

If the number of alternatives for item 1 is increased to t = 5, 
and if guessing is at random, then the value of t would be .893 which 
represents an increase of .126 over the value of x using the existing 
distractors. Thus, it would seem that one approach to improving 
item 1 is to find two more distractors that are about as effective as the 
two being used. Of course in practice, this might be very difficult 
to do. . 

1/ 
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Estimating When There Is Misinformation 

Among the 30 items analyzed by Wilcox (in press, a), the observed 
test scores suggest that two of the items do not conform well to the 
AUC scoring model described in a previous section. Thus, the proposed • 
estimate of t-. is inappropriate. This section outlines how this 
problem might be solved when a misinformation model appears to be more 
appropriate for some (N; the items on the test. 

Consider a test item with t alternatives, and let c t be the pro-^ 
portion of examinees \iho eliminate the correct response from consideration 
on their first attempt <^ the item, (An AUC scoring procedure is being 
assumed.) Once an examinee eliminates all of the distractors that are 
consistent with his/her misinformation, it is assumed that the examinee 

chcoses the correct response on the next attexpt. This assumption is 
made here because it seems to give a ccod approximation to how examinees 
were behaving on the items used in Wilcox (in press a). It is also 
assured that if an examinee does not kno// and does not have misinformation, 
then he/she guesses at random among the t alternatives. Finally, for 
examinees with misinformation, assume that they believe the correct 
response is one of c alternatives that are in actuality incorrect. 
Thus, examinees with misinformation will require at least c + 1 attempts 
before getting the item correct. As an illustration, consider t = 5 
and c = 3. Then, 
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p i = H-i + W 5 . t 32 "-! 

P 3 = W 5 , [34] 

\ 

P 4 = ? t + W 5 \ ' [35] 



P5 = W 5 j [36] 

j 

where £ t+1 is the proportion of examinees who do not know and who do 
not have misinformation. 

Various modifications of the model are, of course, possible and 
presumably this model (with some appropriately chosen c value) will give 
a good fit to the observed test scores. For- illustrative purposes, 
equations 32 - 35 are assumed. The point of this section is that 

it is now possible to again estimate - where the misinformation model 
is assumed to hold for one or both of the items in any item pair. Note 
that for a single item where equations 32 - 35 hold, 



T = £ 
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To estimate x... Uie joint probability of making a correct decision 

on a pair of items where, say, the first item is representee! by a mis- 

\ 

information model, equation 7 must be rederived. Accordingly^ let t* 
be the number of alternatives on the first item, and t is the\ number of 
alternatives on the second. The misinformation model assumes that on 
the first attempt of the item, examinees belong to one of three mutually 
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exclusive categories, namely, they know the answer and choose it, 
they have misinformation and eliminate the correct response, or they 
do not know and guess at random. Thus, using previously established 
notation, equation 8 becomes, ■ ^ 

Pn = Hi + Hi /2V + ? 4o /3t ' + Hz fV + %i /2t ' + W 3t ' 

' where, in this illustration, t 1 = 5. There is no $ i3 term (i = 0, 1, 2) 
because the misinformation model assumes that if examinees do not know, 
they cannot eliminate any of the distractors. More generally, 

"n ■ HM.t-i + X'f-i,^ - «*' + J„V (t " j)t ' ^ 

Also 

- . •- -- 4 *• . Hi ' 

p 12 = ?41 /2u' + ? 40 M 

p 7 „ = 7 c a ,/(t - j)t» (m = o, t-2) . ' D&e-] 

Tha regaining .p. . values can be determined in a similar manner. For the 
two items being used here 

P 2m = .| o ?oj (t " j)t ' (m = 2j t} ' [5r] 



and P3m = p 2i 



m . 



The expression^ for p^ m and p 5m involve the proportion of examinees who 

have misinformation on the first item. The necessary equations can be derived 

\ * > 

as v/as illustrated aboye. This in turn yields an estimate of x which can be 

used to estimate the bounds on p k - 
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Testing Whether Items are Equivalent or Hierarchically Related 

The model described in this paper might also be useful when 

empirically checking the assumptions of other latent structure models. 
For example, Hacready and Dayton (197?) and Wilcox (1977) propose models 
where it is assumed that pairs of equivalent items are available. Two 
items are defined to be equivalent if examinees either know both or neither 
one. When equivalent items are available, the proportion of examinees 
who know both can be estimated (assuming local independence). Macready 
and Dayton checked their model with a chi-square goodness-of-fit test, but 
this requires at least three items that are equivalent to one another. 
(When there are only two items, there are no degrees of freedom left.) 

For illustrative purposes, assuir.s t=3, and consider equations 8-16'. 
If two iteir.s are equivalent, then 



? 21 = ? 20 = ? 12 = Hi = 0 " ^ 

P 12 =P 21 =P 22 - .' .. B 

MO 

p 13 = p 23 



and 

P31 = P 23 



77 



For N < 50, an exact test of these last three equalities can be made using the critical 
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values in Katti (1973) and Smith et al . (1979) (Note that the conditional 
distribution cf multinomial random variables is multinomial.) For larger N, 
1 the usual chi-square test can be used. From Smith et al . (1979), a slight 
adjustment to the usual chi-square test appears to be useful. Finally, if 
one of these items is assumed tq<be hierarchically related to the other, 
again certain equalities must hold among equations 8-16, and this can again 
be tested (cf. White and Clark, 1973; Dayton and Macready, 1976). 

A Concluding Remark 

It should be stressed that Pjc is of interest after it has been decided 
which items are to be included on a test. p k is not intended to measure 
valiclit;y __ .» s designed to measure the overall effectiveness of the dis- 

>c.crs that are oeing used. Put another way, p k is not meant to be the 
one and only index for characterizing a test — it is intended to be one of 
several indices that might be used. The reason for raising this issue is 
that a test constructor can ensure that Pjc is large by using easy items. 
This is an improper procedure that misses the point of how p k is to be used. 
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Table 1 

Number of Examinees Requiring i Attempts on Item 1 
and j Attempts on Item 2 



Number of 
Attempts on 2 
Item 1 

3 

Total 



Number of Attempts on 
Item 2 



1 


2 


3 


Total 




179 


26 


14 


219 


76 


8 


4 


88 


53 


13 


4 


70 


308 


47 


22 


377 
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ABSTRACT 

Consider any scoring procedure for determining whether an examinee 
knows the answer to a test item. Let x^l if a correct decision is made 
about whether the examinee knows the ith item; otherwise x..=0. The k out 
of n reliability of a test is p k =Pr(z x^k). That is, P(c is the probabil- 
ity of making at least k correct decisions for a typical (randomly sampled) 
examinee. This paper proposes an approximation of p k that can be estimated 
with an answer-until -correct test. The paper also suggests a scoring 
procedure that might be used when p k is judged to be too small under a 
conventional scoring rule where it is decided an examinee knows if and 
only if the correct response is given. 



Consider an n-item multiple-choice test, and suppose that every 
examinee can be described as either knowing or not, knowing the correct 
response. In some situations, particularly with respect to some instruc- 
tional program, the goal of a test might be to determine how many of the 
n items an examinee actually knows; in terms of diagnosis, it may even 
be desirable to- determine which specific items an examinee knows or 
does not know. Under a conventional scoring procedure, about the only 
scoring rule available is one where it is decided that an examinee knows 
if and only if a correct response .is given.* Obviously guessing will 
affect the accuracy of this rule. If it is assumed that examinees who 
know will alv/ays give the correct response, and if most examinees really 
do know the correct jresponsje* then of course guessing has little impact 
on the accuracy of the test or the effectiveness of the distractors in 
terms of the typical examinee. However, if ? is the proportion of exam-/ 
inees who know, the answer to an i tern, then as c decreases, the importance 
of having effective distractors increases in order to avoid incorrect de- 
cisions about whether an examinee knows. ^ 

Guessing can seriously affect various other measurement problems 
as well (.e.g., Weitzman, 1970; van den Brink and Koele, 1980; Wilcox, 1980, 
1982c; Ashler, 1979). For example, when estimating the biserial correla- 
tion coefficient, guessing can substantially affect the results (Ashler, 
1979). Ashler gives a method of correcting the estimate for the effects 
of guessing, but it requires a procedure for* determining which items an 
examinee really knows. The conventional rule is to decide an examinee knows 
if and only if the correct response is given, but this can be unsatisfactory. 
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Suppose, for example, ?=.5, and the probability of a correct response, 
given that the examinee does not know, is 1/3. Then 1/6 of the examin- 
ees would be misclassified. The extreme case is where none of the ex- 
aminees know, in which case 1/4 would be incorrectly judged as knowing 
the correct response. 

As another example, suppose an investigator wants to determine 
whether the proportion of examinees who know an item is relatively large. 
In order to ensure a reasonably high probability of a correct decision 
about this proportion,. it follows from Wilcox (1980) that it might be 
necessary to sample ten, perhaps even forty times as many examinees as 
would be required if guessing did not exist. 

For a specific examinee taking a test; let x^l if a correct decision 
is made aboutjih ether the answer to the ith item is known; otherwise 
Xi =0. For an examinee randomly sampled from the population of potential 
examinees, let 

P k = PKIx i > k). 

This is just the probability of making at least k correct decisions among 
the n items for a randomly sampled examinee; P|( is called the k out of n 
reliability of a test. 

Suppose every item has t alternatives. One approach to designing 
a reasonably accurate test is to assume random guessing, and then choose 
t so that P|< is reasonably close to one. If x i is independent of x- for 
all ift, then p k is easily calculated on a computer. Unfortunately, 
there are three serious problems with this approach. First, there is 
considerable empirical evidence that guessing is seldom at random 
{Coombs et al . , 1956; Bliss, 1980; Cross & Frary, 1977; Wilcox, 1982a, 
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1982b). Second, even if guessing is at random, some situations will re- 
quire more alternatives than is practical in order for p k to be close to 
one (Wilcox, 1982c). Finally, there is no particular reason for assuming 
x. independent of x., or to believe that such an assumption will 
give a good approximation of p k - If cov(x l - jXj^O, bounds on p k are avail- 
able (Wilcox, 1982c, in press a), but point estimates do not exist. 

One goal in this paper is to suggest an approximation of p k that can 
be estimated with an answer-until -correct test. Another and perhaps more 
important goal is to describe a scoring procedure that might be used when 
the estimate of p fc is judged to be too small under a conventional scoring 
rule. The new rule is based on a recently proposed latent structure model 
for test items. Included are some results on how to test whether this 
model is consistent with observed test scores. 

2. An Approximation of p k 

Let y s (yj».i.»y n ) any vecio*- of length n where y,=0 or 1, and 
let f(y) be the probability density function of,y. Bahadur (1961) shows 
that f (y) can be written as 

f(y) = f x (y)h(y) 

where 

n v 1-v. 
f (y)= n a^i (1- a,-) ^ 

1 i=l 
o,=Pr(y.=l) 

h(yH ♦ I r i z i z. + J r jm z i z j 2 m * ... ♦ r w 2 r ..2 n 
" i<j<m 
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z r (yj- a i)/E a i( 1_a i^ 

r id = E(z.z.) 
r ijm =ECz i Z 3 Z ni ) 



r 12...n =E(z l z 2-- z n ) ■ ' " ' 

An mth order Bahadur approximation of f is one where the first m summations 
are used in the expression for h. Several authors have used a second 
order approximation. when investigating problems in discrete discriminate 
analysis (e.g., Dillon & Goldstein, 1978; Gilbert, 1968; Moore, 1973). 
In this case f(y) is approximated with 



g(y)=f 1 (y) [I + I r i5 z.z.} C 2 - 1 ) 
Other approximations have been proposed, but as wil I become evident, 
(2.1) .is particularly convenient for the situation at hand. 

Occasionally (2.1) will not be a probability function. In particular, 
it may be that g(y)<0 for some vectors y. In this paper, whenever this 
occurred, g(y) was assumed to be zero, but the g(y) values were not re- 
scaled so that they sum to one. 

Bahadur (1961) discusses how to assess the goodness of fit of the 

approximation. Here, however, interest is ih approximating P|c . Note 

that for a random vector y, P k can be written as 

I f(y) <- 2 * 2 * 
y:S>k. ' 

where S=£y f and the summation in (2.2) is over all vectors y such that S>k. 



Of course, when approximating p k> f(y) would be replaced by f(x) where 
-the vector x indicates which items a correct decision is made about 
whether an examinee knows. To gain some insight into how well g(y) ap- 
proximates p k , assuming a i and r^ are known, we set n=5, k=4 and ran- 
domly chose values for the 2 =32 probability cells. Next, p k was eval- 
uated with -and then it was approximated with p k where p k is given 
by (2.2) with f(y) replaced by g(y). This process was repeated 100 times 
yielding a wide range of values for p k . The values for p k and p k were 
rounded to the second decimal place after which it was found that 85% 
of the time, |p k -P k |< - 02 - For 5% of the approximations it was found 
that |p k -p k ]> .05. For |p k "P k l5 -05 it was also found that p k <p L The 
poorest approximation was for a probability function where p k =.365 and 
P k =.232. Although hardly conclusive, these results suggest that p k is 
"generally useful when approximating p^at least when n is small. For 
n large the test can be broken into subtests containing five items or less, 
and Bonferroni's inequality (e.g., Tong, 1930) can be applied. For example, 
suppose n=10. If for the first five items ^=.95, and for the remaining five 
items p 4 =.93, then for the entire test it is estimated that 

P 8 >l-(l-.98)-(l-.95)=.93. (2.3) 

Estimating p k 

There remains the problem of estimating p k - What is needed is an 
estimate of the parameter in the expression for g(y). An estimate 
is available using a slight extension of the model in Wilcox (in press .a) 
which can be briefly summarized as follows. Assume that examinees take * 
the test according to an answer-until -correct scoring procedure. That is, 
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they choose a response, and if it is wrong they choose another. This 
process continues until the correct response is selected. Administering 
such tests is easily accomplished with especially designed answer sheets 
that are available commercially. 

Consider a specific item and let be the probability" that a ran- 
domly selected examinee gets the item correct on the ith attempt,. 
i=l,...,t where t is the number of alternatives. Let be the propor- 
tion of examinees who can eliminate i distractors (i=0,...,t-l). It is 
assumed that for examinees who do not know, there is at least one effec- 
tive distractor in which case is the proportion of examinees who 
know. It is also assumed that once examinees eliminate as many distrac- 
tors as they can, they guess at random from among those alternatives that 
remain. It follows that 

t-i ! 

P. - I L./Ct-j) (1-1,.... t) (2.4) 

and the model implies that 

P 1 > p 2 ^---> p t ^ 2 ' 5 * 
which can be tested (Robertson, 1978). For empirical results in support 
of this model, see -Wilcox (1982a, 1982b, in press b). In the few instances 
where (2.5) seems to be unreasonable, a misinformation model appears to 
explain the observed test scores.. When (2.5) is assumed, the pool-adjacent 
violators algorithm (Barlow et al., 1972) yields a maximum likelihood esti- 
mate of the P-*s. These estimates in turn yield an estimate of the ^-'s. 

For any pair of items, let P.. be the probability of a correct on the 
ith attempt of the first and the jth attempt of the second, respectively. 
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and let c- • be. the probability that a randomly chosen examinee can elinrin- 
ate i distractors from the first, and j distractors from the second. Then 
x, is the proportion of examinees who know both. It is assumed 

t"" X y, t"" 1 

that an examinee's guessing rate is independent over the items not known, 
and so 

t-k t-m 

■ P km = l I c ij ' Ut-i)Ct-J)D - (2.6) 

Km 1=0 j=0 13 

If the second item has" t" alternatives, t^t', simply replace t with t* in 

the second summation. Testing certain implications' of (2.6) is discussed 

below. 

For the ith item on the test, let tj-ECXj) be the probability of a cor- 
rect decision about whether the examinee knows when a conventional scoring 
procedure is used. Thus, x. plays the role of oj when approximating P|< . Fo 
an answer-untfl -correct test, a conventional rule means to decide an exam- 
inee knows if and only if the correct response is given on the first 
"attempt. In this case (Wilcox, 1932a) 

Ti =? t-i +1 - p i 

-1-P 2 

Thus, if for the ith item, - of N examinees get the correct response on 
the jth attempt under an answer-until -correct scoring procedure, then 

*1 = 1_C 2 /N 

is an estimate of t-. If the t-'s are inconsistent with (2.5), apply 

1 %J 
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the pool-adjacent-violators algorithm (Barlow et al., 1972, pp. 13-16), 
as was previously mentioned. 

In a similar manner, let Tlj .= pftx^l.Xj-l), i.e., x^- is the prob- 
ability of a correct decision for both items i and j. For the conventional 
decision rule under an answer-until -correct model, it can be seen that 

t t 

T ij = I i * . ^km . 
J k=l m=l , 

where 

q ll = ? t-l, t-1 

^1 S ^W N (i=2,...,t) 

q,, = V K t , J(t-k) (J-2,...,t) ' 

1J k=0 ,K 

q..= p. . Ol and j>l) 

(Wilcox, in press a) 

Thus, r.., z. and z- in equation (2.1) -are easily determined. 
1 3 i J 

In particular, 

1J [T i T j (l-T 1 )(l-T ;i )] !5 

where x 1 plays the role of in the definition of z.. But as noted in 

Wilcox (in press), the ?,,'s in equation (2.6) are easily estimated, and 
these estimates yield an estimate of which in turn gives an estimate 
of r^.. tience, p k can be estimated with equation (2. v l) which gives an 
approximation of p^. 
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Testing Certain Implications of the Model 
For any pair of items, equation (2.6) implies -that 

P n- p 12^"'- p lt'- P 2t'-*--- p tt' s (2.7a) 

Pn>P 2 i>- • ->Ptl^ p t2^* ' *? p tt' (2.7b) 

P il- p i2-"*- p it' (i=2,...,t-l) * (2.7c) 



P l.i- P 2j-*"- P tj (j a 2,...„t'-l) (2.7d) 



where as before, t and V are the number of alternatives for the first 
and second items, respectively. A few other inequalities are implied 
if the ?.-'s are assumed to be probabilities, but these have not been 
derived. 

Experience with real data suggests that when observed scores are 
consistent with (2.5), the inequalities in (2.7) will also hold. If 
some of the observed proportions are inconsistent with (2.7), maximum 
likelihood estimates can be obtained when the model is assumed to be 
true by applying the minimax order algorithm in Barlow et al . (1972). 

Robertson (1978) includes some asymptotic results on testing (2.7). 
At the moment, however, his proposed procedure can not be applied because 
certain constants (the P U,k)'s in Robertson's notation) are not avail- 
able. An alternative approach is to perform a separate test of the in- 
equalities in (2.7d), one corresponding to every J > j=2,..., t'-l, then 
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perform a test of (2.7c), one for every i=2,...,t-l, then test (2.7b) 
and finally (2.7a). The total number of tests is m=t+t*-2. 'if the 
critical value for every test is set at o/m, then from the Bonferroni 
inequality . (e.g., Tong, 1980), the probability of a Type I error among 
the m tests is at most a. 

Consider, for example, the inequalities in (2.7d) for j=2. That 

is, the goal is to test < 

%: P 1 2^ p 32^-^ p t2 (2 - 8J 
Let A be the likelihood ratio for testing (2.8) where the alternative 

hypothesis is no restriction on the proportions. From Robertson (1978, 

Theorem 2), the asymptotic null distribution of T=-2 In \ is 

Pr(T>T>£ PU,*0?r( x 2 >T 0 ) ( 2 - 9 ) 
4=1 k-£ 

where. P(£.k) is |he probability that the maximum likelihood estimate of 
p l9 ,. t .,p t2 subject to (2.8) will have £ distinct values among the k param- 
eters being estimated, and x^s & chi -square random variable with k-z 
degrees of freedom. For (2.8), k=t. (As previously mentioned, the pool- 
adjacent-violators algorithm yields maximum likelihood estimates when (2.8) 
is assumed.) The constants P(£,k) can be read from Table A. 5 in Barlow 
et al. (1972). 

Thus, in order for the m tests to have a .critical level of at most a, 
choose T Q so that (2.9) equals a/m, and reject H Q if T>T Q . This process 
is repeated for the other inequalities to be tested, but note that k (the 
number of parameters being tested) will have a different value for (2.7a) 
and (2.7b). 
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To facilitate this procedure, critical values "are reported in 
Table 1 for t=2(l)5, a=.l, .05, .01; and some appropriately chosen, values 
for m. (Additional values for m were not used because for t<5, these 
are the only values of m that will occur.] 

As an illustration, suppose t=t'=3. Then ther^e are m=4 sets of * 
inequalities to be tested. If a=.05, then from (2.7a) there are k=5 
parameters, and so T Q =10.81. For (2.7b) again k=5 and T Q =10.81. For 
(2.7c) there is only one set of inequalities which corresponds to i=2, 
t=k=3, and T Q =7.24. The same is true for (2.7d). 

3. A Scoring Procedure for Tests 

Consider a specific item on an n-item test. In contrast to most of 
the existing scoring procedures, the goal here is to minimize the expected 
number of examinees for whom an incorrect decision is made about whether 
they know the answer to the iten. It is interesting to note that when 
items are scored right/wrong, this criterion can rule out the conventional 
rule where it is decided an examinee knows if and only if the correct 
response is given. The extreme case is where e^j^O, i.e., none of 
the examinees know, in which case the optimal rule is to decide that an 
examinee does not know regardless of the response given. If 6=Pr (correct | 
examinee does not know), it can be seen that if an item is scored right/ 
wrong, and if P>5 t- i/Cl-C t-1 ) the optimal rule is to always decide that 
examinees do not know. If P < C t . 1 /d-C t .i)> use the conventional rule. 
From Copas (1974), this approach (in terms of parameters) is admissible. , 
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These parameters can be estimated which yields an estimate of the optimal 
decision rule (e.g., Macready and Dayton, 1977). Jhe goal here is to 
derive a decision rule based on an answer-until -correct scoring procedure. 
The advantage of this new approach is that it is not necessary to assume 
alVn items are equivalent as was done in Macready and Dayton. (Two items 
are said to be equivalent if every examinee knows both or neither one.) 
The results in Macready 'and Dayton (1977) could be extended to the case 
of hierarchically related items by applying results 4n Dayton and Macready' 
(1976), but here the goal is, to derive a rule where no particular relation- 
ship is assumed among the items. However, the situation considered by 
Macready and Dayton (1977) has the advantage of allowing Pr (in- 
correct response I examinee knows) >0, while here this probability is 
assumed to be zero. 

Consider the ith item on a test taken by a specific examinee, and 
let w =1 if it is decided the examinee knows; otherwise w-=0. Consider 
the jth item on the test ifcj for the purpose of assisting in the decision 
about whether w- should be 1 or 0. jThe, optimal choice for the second 
item will become evideilTTr"lt~isassuraed that items are administered 
according to an answer-until -correct scoring procedure. For a specific 
examinee, let v^ be the number of attempts needed to choose the correct 
response to the ith item. The decision rule to be considered is 

1. if Vi <v oi and Vj <v 

w.(v.)= (3. 
1 J 0, otherwise 
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where v Ql . (=1 or 2) and v Qj . (l<v Q .j<t') are constants to be determined. 

Note that when v Qi =2 and v Qj .=l, the rule is similar to the one in Macready 

and Dayton (1977). Also note that v Qj .=t' corresponds to the conventional 

decision rule where the information about the jth item plays no role in \ 

determining whether the examinee knows the ith. It is evident, therefore,! 

that in terms of parameters, (3.1) always improves upon the conventional 

approach. The improvement actually achieved will of course vary. If 

c is close to one for every item, p t will also be close to one under 
t-1 K 

a conventional scoring rule, in which case there is little motivation 
for using (3.1). However, when Pjc is unacccptably small, (3.1) can in- 
crease pj, by a substantial amount. 

One problem is choosing the constants v Q . and v Qj .. A solution is 
as follows. For a randomly sar.pl ed exam" nee responding to the ith and 
jth itens, let be the probability of choosing the correct response 
on the kth attempt of the ith item, the mth attempt of the jth item, and 
making a correct decision under the rule (3.1). The probability of a 
correct decision for a randomly sampled examinee is 

t t' 

I p kml 
c k=l m=l 1 

which is a function of v Qi and v Q .. Thus, the obvious choice for v Q . 
and v Q j is the one that maximizes p c - 

Let 



and 



Q = I I P ti 



For v Qi =2 and any v Qj 



v 0j f 



c k=l lk k=v oj +l ,K tK 
When v 0j .=t', the second sum in (3.3) is taken to be zero. As for v oi =l, 

Pc = Q + k=l (Plk " qik) * (3 ' 4) 
Thus, to determine the optimal choice for v Q . and v Qj - in (3.1), simply 

evaluate p c for every possible choice of v Q - and v Qj . , and then set v Q - 

and v Q j equal to the values that maximize P c - Of course, when 

making a decision about the ith item, this process can be repeated over 

the n-1 other i terns *ort the test. The item that maximizes p c is the one 



that should be used when determining wnether an examinee knows the ith 
item. I .... 

An Illustration 

As a simple illustration, the optimal rule is estimated for two items 
used in Wilcox Cl982a : ). The observed frequencies are shown in Table 2. Note 
that the observed frequencies already satisfy (2.7a)-(2.7d) . For the first 
item the estimate of T , the probability of correctly determining whether a 
randomly sampled examinee. knows, is ^=(236-71)/236=.699. For the second 

items it is t2=-78. ' * 

Suppose the second item is used to help determine whether an examinee 
knows the first. Let v Q1 =2 and v Q2 =l. Thus, a correct response must be 
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ven on the first attempt of both items in order to decide that an exam- 



inee knows. Note that , / 

4 ' / 

Q=l- I P U . 

3=1 iJ ' / 

and so Q is estimated to be .513. / 

The easiest way to estimate the s-.'s is to start with p tt '. 
From_(2.6) with t=t =4, f 

P 44= W 16 
and so from Table 2, ?^=0. 

Next consider / 

v ? oi /12+ V 16y 

and so c Q1 =.048. Eventually this process yields estimates of all the ^'s 

which in turn yields estimates of q^, 3=1 4. The estimates turn out to 

be q n =.109, q 12 =.012, q 13 =.012, and q, 4 =0. Thus, the estimate of ? Q is 
. 513+.109*( .089- .012)+( .042- . G12)-K - 013-0.0)= .742. 

If instead v 01 =l, so that it is always decided an examinee does not 
know, regardless of the observed response, ? c is just one minus the pro- 
portion of examinees who know. From (2.4), C t _ 1 = P 1 -P 2 » s0 tne estimate 
of P c is 1-. 187=. 813. This is a substantial increase in accuracy over 
the conventional rule. 

Determining p k Under the New Scoring Procedure 

It is evident that the scoring procedure represented by (3.1) im- 
proves upon the conventional scoring procedure, but when v Qj . in (3.1) 
is less than t', the method already described for determining p k will 
in general be inadequate. The reason is that^to determine p^, t^- i 



(the joint probability of making a correct decision about the ith and 
jth item) must be known. (See Section 2.) But when v Qj .<t, may 
depend on two other items, say items k and m. That is, information on 
the kth item and mth item will be used to determine whether the examinee 
knows the ith and jth items respectively. Hence, (2.6} is no longer ade- 
quate for determining p^. 

One solution might be to extend (2.6) to include four items. In 
theory the parameters could be estimated under the resulting inequalities 
by -applying the minimax order algorithm. However, writing an _ H propriate 
computer program that is valid for t<5 will be a relatively involved task. 

Another and perhaps more practical approach might be to restrict the 
decision rule so that if the response tc the jth item is used in the de- 
cision about whether an examinee knows the ith item, then the response 
to the ith will be used in deciding abcutlthe jth. An advantage of this 
approach is that it simplifies the process of X ehoosing a decision- rule- 
by reducing the number of pairs of items that are considered. A second 
advantage is that an approximation of p k can be made using the results in 
section 2. A disadvantage is that by restricting the class of decision 
rules, the potential increase in p k (over what it is under a conventional 
scoring rule) is reduced. Perhaps this is not a serious problem; at the 
moment it is impossible to say. 

An approach to choosing a scoring rule might be as follows: First 
estimate P(< under conventional scoring rule. If it is judged to be too 
small, choose a decision rule from among the rules described in the pre- 
ceding paragraph and then estimate P k in the manner indicated below. If 
p. is still too small, choose a decision rule from among the broader class 



17 



of rules described in the preceding subsection. In this case, however, 
an approximation of p k is no longer available for the reasons just given. 

Suppose that if the jth item is chosen to aid in the decision about 
the ith, then the ith item is used in the decision rule for the jth. 
What is needed in order to approximate p k is an expression for the joint 
probability of making a correct decision for both items. Accordingly, 
consider any two items, and let Uj(k,m)=l if it is decided that an exam- 
inee knows the first item if the correct response is given on the 
kth attempt of the first item, and the mth attempt of the second; other- 
wise u-j(k,m)=0. Similarly, u 2 (k,m)=l if it is decided that an examinee 
knows the second item if the correct response is given on the kth 
attempt of the first and the mth ctzsr.pt of the second; otherwise u 2 (k,m)=0. 
Let 

1, if u 1 (k,m}=l and i=t-l, or if 
u-j (k,m) =0 and i<t-l 

s li Ck,m}= 

0, otherwise 

and 

1, if u 2 (k,m)=l and j=t'-l, or if 

u 2 (k,m)=0 and j<t'-l 

s ? .(k,m)= 
J 0, otherwise. 

Recall .that the probability of getting the correct response on the kth 

attempt of the first item and the mth attempt of the second is given by 

(2.6). From this expression it can be seen that the joint probability 

of k attempts on the first item, m attempts on the second, and a correct 



9 

ERIC 



4o 



18 



decision on both items is 

' ... 

t-k t"-m 

T k»"L 0 f i0 h.tk.mJsytk^^/ftt-Dtt-j)]. 

Thus, for a randomly sampled examinee, the joint probability of a correct 
decision for both items, say items i and j, is 

t t' 

The joint probability of a correct decision about the first item, k at- *». . 
tempts on the first and m attempts on the second is 



t t' 
1=0 j=0 

The corresponding probability for the. second item is 



Thus, , the probability of a correct decision about the ith item on 

a test (using the jth item in (3.1)) for a randomly sampled, examinee, is 

t t' 

T i = L L *km 
k=l m=l 

Similarly, for the second item, item j, 
t t* 

Hence, pj, can be approximated as described in section 2. 
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Concluding Remarks 



Virtually all of the results on the proposed 'scoring rule have been 
in terms of parameters. These parameters are not known, but they are 
easily estimated. The question arises as to the sampling effects on 
estimating the approximation of p^, and on estimating the optimal de- 
cision rule for determining whether an examinee knows the correct respons 
In some instances, a large number of examinees will be available, and so' 
very accurate estimates of the parameters can be obtained. This is the 
case for certain testing firms where literally thousands of examinees 
take the same test. When the number of examinees is small, however, 
sampling fluctuations need to be taken into account; this problem is 
currently being investigated. 

Another important feature of the proposed scoring rule is that the 
decision about whether an examinee kno*s an item is a function of the 
responses given by the other exaninees. < If the goal is to minimize the 
number of examinees for whom an incorrect decision is made, there is no 
problem. However, in some instances, this feature might be objectionable 
Suppose, for example, an examinee takes a test to determine whether a 
high school diploma will be received. It is possible for an examinee 
to fail because of how other examinees perform on the test even though 
the examinee in question deserves to pass. If this type of error is 
highly objectionable, perhaps the proposed scoring rule should be used 
only in diagnostic situations where the goal is to determine how many 
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items an examinee actually knows, or which specific items are not known. 

A technical point that should be mentioned is that a few of the 
£..'s were slightly negative in which case was set equal to zero. 
As a result, the £.-'s sum to .997 rather than one as they should. 
The problem is that equations (2.7a)-(2.7d) are necessary but not suf- 
ficient conditions for the model to hold. For example, these inequalities 
do not guarantee that ? 12 will be positive. 

Despite these difficulties, there will be situations where correct- 
ing for guessing can be important. Some examples were given at the begin- 
ning^ the paper. Even if a conventional scoring procedure is to be 
used in operational versions of a test, it might be important to first 
estimate the effects of guessing using an answer-until -correct scoring 
procedure. 

Many scoring rules have been proposed that are based on various cri- 
teria. If a particular criterion is deeded important, of course the cor- 
responding scoring rule should be considered. The point is that most of 
these rules are not based on the goal of determining how many items an 
examinee knows, or which specific skills an examinee has failed to learn. 
Moreover, typical rules usually ignore guessing or assume guessing is at 
random. Thus, the results reported here might be useful in certain 
situations. 
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TABLE 1 



Critical Values T Q for the Bonferroni 
Test of Equations (2.7) 



k 


m 


a: -1 


. Ub 




3 


4 


5.90 


7.24 


1GL38 


3 


5 


6.33 


7.67 


10J81 


3 


6 


6.68 


8.03 


ll.l\ 


4 


3 


7.03 


8.49 


il.86\ 


4 


4 


7.64 


9.10 


12.46 J 


4 


5 


8.11 


9.56 


12.92 


4 


6 


8.49 


9.95 


13.30 


5 


4 


9.25 


10.81 


14.36 


5 


5 


9.75 


11.31 


14.85 


5 


6 


10.15 


11.71 


15.25 


5 


7 


10.51 


. 12.05 


15.58 


5 


8 


10.81 


12.35 


15.87 


6 


5 


11.32 


12.96 


16.67 


7 


6 


■ ( 13.29 


15.00 


18.85 


8 


7 


15.18 


16.95 


20.93 


9 


8 . 


17.02 


18.84 


22.34 



Table 2 



OBSERVED FREQUENCIES FOR TWO ITEMS ADMINISTERED UNDER 
AN ANSWER-UNTI L-CORRECT SCORING PROCEDURE 



Number of Attempts for the Second Item 







1 


2 


3 


4 




Number of 


1 


81 


21 


10 


3 


115 


Attempts 


2 


44 


18 


6 


3 


71 


for the 


3 


20 


7 


5 


i 


33 


First Item 


4 


10 


6 


1 


0 


17 






155 


52 


22 


7 


236 
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HOW DO EXAMINEES BEHAVE WHEN TAKING 
MULTIPLE CHOICE TESTS? 



Rand R. Wilcox 



Center for the Study of Evaluation 
University of California, Los Angeles 



Horst U 933 ) assumed that when examinees respond to a multiple- 
choice test item, they eliminate as many distractors as possible, and 
guess at random from among those that remain. More recently, Wilcox 
,(1981) proposed a latent structure model for achievement test items that 
was based on this assumption and which solves various measurement prob- 
lems, (See also, Wilcox, 1982a, 1982b.) 

Suppose an item is administered according to an answer-until -correct 
(AUC) scoring procedure. That is, examinees chose a response, and they ■ 
are told whether it is correct. If incorrect they choose another response, 
and this process continues until the correct response is selected. Now 
consider two specific distractors. If Horst's assumption is true, then 
ar.ong the examinees choosing these two distractors, the order in which 
they are chosen should be at rar.dcn. Of course for 3 distractors the 
sa^e conclusion holds, only nc* there are 6 patterns of responses rather 
than 2. An empirical investigation of this implication is described below. 

As was done on previous tests, the final examination for students 
enrolled in an introductory psychology course was administered according 
to an AUC scoring procedure. For 26 items, examinees were asked to record 
the order in which they chose their responses. Bonus points were given 
to those examinees complying with this request. There were 236 examinees 
who took the first 13 items, and 237 examinees took the remaining 13, 
All items had 4 alternatives. 

For any two distractors, the null hypothesis of random order in re- * 
sponses can be tested with the usual sign test. Among the examinees 
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choosing all three distractors, the chi-square test given by 
[Insert* Equation 1 here^. 

was used where x^ is the number of examinees choosing the ith response 
pattern, and N is the number of examinees choosing all three distractors. 
Some exact critical values are given by Katti (1973), and Smith et a!., 
(1979) and they were used whenever possible. For larger values of N, 
the adjusted, chi-square test was .used .(Smith e t a!., 1979). 

For each item, the responses to all pairs of distractors were tabu- 
lated. For N<5, no test was made because it is impossible to reject the 
null hypothesis at the .1 level. For the first test form, 29 tests were 
made and the hypothesis of random choices was rejected three times at 
the .1 level. For the second test fern-, 25 tests were performed, and again 
H Q was rejected 3 times. 

Next, an analysis was perfc^ed on those responses where all three 
distractors were chosen. Again no test was made for N<5. The largest 
value for N was 75. 

At the .1 level, H Q was rejected for 5 of the^ 12 items on the first 
test form, and for the second test form the rejection rate was 3 of 11. 

The question remains as to the relative extent to which responses 
are not at random when H Q is rejected. For the case where all three 
distractors were chosen, this quantity was measured with ^ 

\ 

[Insert Equation 2 here] \ 

\ 

\ 
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whprp X and X • are the maximum and minimum possible values of X . 
From Smith et al . (1979), X^ aX = 5N, and X* in is given by Dahiya (1971). 
The quantity w has a value between 0 and 1 inclusive The closer w is 
to one, the more unequal are the cell probabilities in a multinomial 
distribution. 

Marshall and 01 kin (1979) suggest that when measuring inequality, 

a certain class of functions (called Schur functions) should be used. 

2 

Writing Equation 1 as a function of (Dahiya, 1971) and noting that 
Ex? is just Simpson's measure of diversity, it follows from results 
in Marshall and 01 kin (1979) that w is a Schur function. 

2 

Note that using the w statistic is similar to using Hays' a (Hays, 
1S73). That is, rejecting the na';l hypothesis does not indicate the ex- 
tent to which the cell probabilities are unequal. It may be, for example 
that the cell probabilities are nat equal, but that for practical purpose 
they are nearly the same in value. 

For the first test form where H Q was rejected, the w values were 
found to be .074, .183, .286, .167, and .137. For the second test form 
they were .098, .125 and .133. Thus, even when H Q is rejected, Horst's 
assumption appears to be a tolerable approximation of reality in most 
cases. Of course there will probably be items where this assumption is 
grossly inadequate. In this case the measurement procedures proposed 
by Wilcox may be totally inappropriate. 
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